32 research outputs found
SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation
Estimating 3D orientation and translation of objects is essential for
infrastructure-less autonomous navigation and driving. In case of monocular
vision, successful methods have been mainly based on two ingredients: (i) a
network generating 2D region proposals, (ii) a R-CNN structure predicting 3D
object pose by utilizing the acquired regions of interest. We argue that the 2D
detection network is redundant and introduces non-negligible noise for 3D
detection. Hence, we propose a novel 3D object detection method, named SMOKE,
in this paper that predicts a 3D bounding box for each detected object by
combining a single keypoint estimate with regressed 3D variables. As a second
contribution, we propose a multi-step disentangling approach for constructing
the 3D bounding box, which significantly improves both training convergence and
detection accuracy. In contrast to previous 3D detection techniques, our method
does not require complicated pre/post-processing, extra data, and a refinement
stage. Despite of its structural simplicity, our proposed SMOKE network
outperforms all existing monocular 3D detection methods on the KITTI dataset,
giving the best state-of-the-art result on both 3D object detection and Bird's
eye view evaluation. The code will be made publicly available.Comment: 8 pages, 6 figure
Graph-Segmenter: Graph Transformer with Boundary-aware Attention for Semantic Segmentation
The transformer-based semantic segmentation approaches, which divide the
image into different regions by sliding windows and model the relation inside
each window, have achieved outstanding success. However, since the relation
modeling between windows was not the primary emphasis of previous work, it was
not fully utilized. To address this issue, we propose a Graph-Segmenter,
including a Graph Transformer and a Boundary-aware Attention module, which is
an effective network for simultaneously modeling the more profound relation
between windows in a global view and various pixels inside each window as a
local one, and for substantial low-cost boundary adjustment. Specifically, we
treat every window and pixel inside the window as nodes to construct graphs for
both views and devise the Graph Transformer. The introduced boundary-aware
attention module optimizes the edge information of the target objects by
modeling the relationship between the pixel on the object's edge. Extensive
experiments on three widely used semantic segmentation datasets (Cityscapes,
ADE-20k and PASCAL Context) demonstrate that our proposed network, a Graph
Transformer with Boundary-aware Attention, can achieve state-of-the-art
segmentation performance
ADD: An Automatic Desensitization Fisheye Dataset for Autonomous Driving
Autonomous driving systems require many images for analyzing the surrounding
environment. However, there is fewer data protection for private information
among these captured images, such as pedestrian faces or vehicle license
plates, which has become a significant issue. In this paper, in response to the
call for data security laws and regulations and based on the advantages of
large Field of View(FoV) of the fisheye camera, we build the first Autopilot
Desensitization Dataset, called ADD, and formulate the first
deep-learning-based image desensitization framework, to promote the study of
image desensitization in autonomous driving scenarios. The compiled dataset
consists of 650K images, including different face and vehicle license plate
information captured by the surround-view fisheye camera. It covers various
autonomous driving scenarios, including diverse facial characteristics and
license plate colors. Then, we propose an efficient multitask desensitization
network called DesCenterNet as a benchmark on the ADD dataset, which can
perform face and vehicle license plate detection and desensitization tasks.
Based on ADD, we further provide an evaluation criterion for desensitization
performance, and extensive comparison experiments have verified the
effectiveness and superiority of our method on image desensitization
LineMarkNet: Line Landmark Detection for Valet Parking
We aim for accurate and efficient line landmark detection for valet parking,
which is a long-standing yet unsolved problem in autonomous driving. To this
end, we present a deep line landmark detection system where we carefully design
the modules to be lightweight. Specifically, we first empirically design four
general line landmarks including three physical lines and one novel mental
line. The four line landmarks are effective for valet parking. We then develop
a deep network (LineMarkNet) to detect line landmarks from surround-view
cameras where we, via the pre-calibrated homography, fuse context from four
separate cameras into the unified bird-eye-view (BEV) space, specifically we
fuse the surroundview features and BEV features, then employ the multi-task
decoder to detect multiple line landmarks where we apply the center-based
strategy for object detection task, and design our graph transformer to enhance
the vision transformer with hierarchical level graph reasoning for semantic
segmentation task. At last, we further parameterize the detected line landmarks
(e.g., intercept-slope form) whereby a novel filtering backend incorporates
temporal and multi-view consistency to achieve smooth and stable detection.
Moreover, we annotate a large-scale dataset to validate our method.
Experimental results show that our framework achieves the enhanced performance
compared with several line detection methods and validate the multi-task
network's efficiency about the real-time line landmark detection on the
Qualcomm 820A platform while meantime keeps superior accuracy, with our deep
line landmark detection system.Comment: 29 pages, 12 figure
Complete Solution for Vehicle Re-ID in Surround-view Camera System
Vehicle re-identification (Re-ID) is a critical component of the autonomous
driving perception system, and research in this area has accelerated in recent
years. However, there is yet no perfect solution to the vehicle
re-identification issue associated with the car's surround-view camera system.
Our analysis identifies two significant issues in the aforementioned scenario:
i) It is difficult to identify the same vehicle in many picture frames due to
the unique construction of the fisheye camera. ii) The appearance of the same
vehicle when seen via the surround vision system's several cameras is rather
different. To overcome these issues, we suggest an integrative vehicle Re-ID
solution method. On the one hand, we provide a technique for determining the
consistency of the tracking box drift with respect to the target. On the other
hand, we combine a Re-ID network based on the attention mechanism with spatial
limitations to increase performance in situations involving multiple cameras.
Finally, our approach combines state-of-the-art accuracy with real-time
performance. We will soon make the source code and annotated fisheye dataset
available.Comment: 11 pages, 10 figures. arXiv admin note: substantial text overlap with
arXiv:2006.1650
Surround-view Fisheye BEV-Perception for Valet Parking: Dataset, Baseline and Distortion-insensitive Multi-task Framework
Surround-view fisheye perception under valet parking scenes is fundamental
and crucial in autonomous driving. Environmental conditions in parking lots
perform differently from the common public datasets, such as imperfect light
and opacity, which substantially impacts on perception performance. Most
existing networks based on public datasets may generalize suboptimal results on
these valet parking scenes, also affected by the fisheye distortion. In this
article, we introduce a new large-scale fisheye dataset called Fisheye Parking
Dataset(FPD) to promote the research in dealing with diverse real-world
surround-view parking cases. Notably, our compiled FPD exhibits excellent
characteristics for different surround-view perception tasks. In addition, we
also propose our real-time distortion-insensitive multi-task framework Fisheye
Perception Network (FPNet), which improves the surround-view fisheye BEV
perception by enhancing the fisheye distortion operation and multi-task
lightweight designs. Extensive experiments validate the effectiveness of our
approach and the dataset's exceptional generalizability.Comment: 12 pages, 11 figure
ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation
Monocular depth estimation is challenging due to its inherent ambiguity and
ill-posed nature, yet it is quite important to many applications. While recent
works achieve limited accuracy by designing increasingly complicated networks
to extract features with limited spatial geometric cues from a single RGB
image, we intend to introduce spatial cues by training a teacher network that
leverages left-right image pairs as inputs and transferring the learned 3D
geometry-aware knowledge to the monocular student network. Specifically, we
present a novel knowledge distillation framework, named ADU-Depth, with the
goal of leveraging the well-trained teacher network to guide the learning of
the student network, thus boosting the precise depth estimation with the help
of extra spatial scene information. To enable domain adaptation and ensure
effective and smooth knowledge transfer from teacher to student, we apply both
attention-adapted feature distillation and focal-depth-adapted response
distillation in the training stage. In addition, we explicitly model the
uncertainty of depth estimation to guide distillation in both feature space and
result space to better produce 3D-aware knowledge from monocular observations
and thus enhance the learning for hard-to-predict image regions. Our extensive
experiments on the real depth estimation datasets KITTI and DrivingStereo
demonstrate the effectiveness of the proposed method, which ranked 1st on the
challenging KITTI online benchmark.Comment: accepted by CoRL 202